Quick Start: Gaussian Graphical Models with Multi-Study Factor Analysis (MSFA-X)
Introduction
This quick start guide is a companion to the preprint “Gaussian graphical modeling of multi-study data with Multi-Study Factor Analysis”(Shutta, Scholtens, et al., 2022).
Installing and loading libraries
The first step is to install and load MSFA-X, which is contained in
the extended MSFA R package from the fork available at https://github.com/katehoffshutta/MSFA.
We will use the devtools package to do this, so you will
need to install devtools first if you don’t have it
already.
library(devtools)
install_github("katehoffshutta/MSFA")
library(MSFA)There are several other packages that we will use downstream to analyze the model results. Install them prior to running this chunk.
library(corrplot)
library(dplyr)
library(forcats)
library(ggplot2)
library(igraph)Loading and filtering the data
Next, we will load the breast cancer dataset used by (De
Vito et al., 2021). This dataset consists of measurements for
the same 6362 genes across seven different breast cancer samples. For
more details, run ?data_breastCancer05.
data(data_breastCancer05)
length(data_breastCancer05)## [1] 7
sapply(data_breastCancer05,dim)## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 118 353 200 99 198 133 344
## [2,] 6362 6362 6362 6362 6362 6362 6362
The capacity to estimate GGMs is currently built on the frequentist version of MSFA (De Vito et al., 2019). This does not support high-dimensional data, so it is necessary to first reduce the number of predictors to smaller than the sample size. An area of future work is to extend the Bayesian version of MSFA (De Vito et al., 2021) in a similar way; this would enable graphical modeling for high-dimensional data. In the meantime, for illustration of the low-dimensional cases, we select the 1 percent genes with highest average variability across the studies.
geneVars = sapply(data_breastCancer05,function(x){apply(x,2,var)})
avgGeneVars = apply(geneVars,1,mean)
topVarIdx = which(avgGeneVars > quantile(avgGeneVars,0.99))
topVarGenes = names(avgGeneVars)[topVarIdx]Next, we filter the breast cancer datasets to only these genes and verify that all datasets are now low-dimensional.
X_s = lapply(data_breastCancer05,function(x){x[,colnames(x) %in% topVarGenes]})
sapply(X_s,dim)## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 118 353 200 99 198 133 344
## [2,] 64 64 64 64 64 64 64
Determining the number of factors
Before running MSFA, it is necessary to estimate the number of shared
and study-specific factors. We provide the function
get_factor_count for this purpose. This process takes about
10 minutes to run on a standard Macbook Pro, so we have commented it out
here and load the object instead.
# nFactors = get_factor_count(X_s, method="cng")
# save(nFactors,file="nFactors.rda")
load("nFactors.rda")The result contains the estimated number of shared factors:
nFactors$k## [1] 2
and the estimated number of study-specific factors:
nFactors$j_s## [1] 1 1 1 1 1 1 1
We use these numbers as input to MSFA-X. The first step is to
estimate a starting point for the expectation-conditional maximization
(ECM) algorithm used in MSFA, applying the start_msfa
function from the original MSFA code.
start_point = start_msfa(X_s = X_s, k = nFactors$k, j_s = nFactors$j_s)Running MSFA-X
Next, we run the extended MSFA algorithm (MSFA-X). This algorithm
will first run the ECM algorithm described in (De Vito et al.,
2019). Next, it will estimate the graphical model parameters
post hoc. The argument extend = TRUE is what tells MSFA to
perform this second step.
msfax_model = ecm_msfa(X_s = X_s, start = start_point, extend = TRUE)## i= 1000 Criterion for convergence 0.001324233
## i= 2000 Criterion for convergence 0.0005082112
## i= 3000 Criterion for convergence 0.001825515
## i= 4000 Criterion for convergence 0.0004093674
## i= 5000 Criterion for convergence 9.111513e-05
## i= 6000 Criterion for convergence 2.034439e-05
## i= 7000 Criterion for convergence 4.555826e-06
## i= 8000 Criterion for convergence 1.021952e-06
## i= 9000 Criterion for convergence 2.294291e-07
We can visually investigate the parameters of the MSFA model using
the corrplot package. We see that the two shared loadings
Phi1 and Phi2 load on different sets of genes
and that the study-specific loadings Lambda*1 are generally
somewhat similar, with the exception of Study 1.
allFactors = cbind(msfax_model$Phi, Reduce(cbind, msfax_model$Lambda_s))
rownames(allFactors) = colnames(X_s[[1]])
colnames(allFactors) = c("Phi1","Phi2","Lambda11","Lambda21","Lambda31","Lambda41","Lambda51","Lambda61","Lambda71")
corrplot(allFactors,is.corr=F)Creating graphs from MSFA-X output
Next, we extract the adjacency matrices defining the GGMs estimated by MSFA-X. Shared and study-specific precision matrices are converted to GGMs using the relationship between the \(i,j\) entry of the precision matrix \(\Theta\) and the partial correlation \(\rho_{ij}\) (Shutta, De Vito, et al., 2022):
\[ \rho_{ij} = -\frac{\Theta_{ij}}{\sqrt{\Theta_{ii}\Theta_{jj}}} \]
This can be done efficiently using the cov2cor function
in R:
shared_adjmat = -cov2cor(msfax_model$SharedPrec)
colnames(shared_adjmat) = colnames(msfax_model$X_s[[1]])
study_adjmats = lapply(msfax_model$StudyPrec,function(x){adjmat = -cov2cor(x); colnames(adjmat) = colnames(msfax_model$X_s[[1]]); adjmat})Next, we will use the igraph R package to convert these
matrices into graph objects:
shared_graph = graph_from_adjacency_matrix(shared_adjmat,mode="undirected",diag=F,weighted=T)
study_graphs = lapply(study_adjmats, graph_from_adjacency_matrix,mode="undirected",diag=F,weighted=T)We can visualize these graphs by using the plot function
of igraph. There are many layout options (see
?plot.igraph), but the easiest one for comparing graphs is
the circular layout because it preserves the ordering of the nodes in
every case.
msfaxPlot = function(my_graph, my_title="")
{
plot(my_graph,
vertex.size=2,
vertex.color="white",
vertex.label.color="black",
edge.width = 5*abs(E(my_graph)$weight),
edge.color = ifelse(E(my_graph)$weight > 0, "red","blue"),
layout = layout_in_circle)
title(my_title)
}
msfaxPlot(shared_graph, my_title="Shared")for(i in 1:7)
msfaxPlot(study_graphs[[i]], my_title=paste("Study",i))